Exposing Inner Kernels and Block Storage for Fast Parallel Dense Linear Algebra Codes⋆
نویسنده
چکیده
Efficient execution on processors with multiple cores requires the exploitation of parallelism within the processor. For many dense linear algebra codes this, in turn, requires the efficient execution of codes which operate on relatively small matrices. Efficient implementations of dense Basic Linear Algebra Subroutines exist (BLAS libraries). However, calls to BLAS libraries introduce large overheads when they operate on small matrices. High performance implementations of parallel dense linear algebra codes can be achieved by replacing calls to standard BLAS libraries with calls to specialized inner kernels which work efficiently on small data submatrices.
منابع مشابه
New Data Structures for Matrices and Specialized Inner Kernels: Low Overhead for High Performance
Dense linear algebra codes are often expressed and coded in terms of BLAS calls. This approach, however, achieves suboptimal performance due to the overheads associated to such calls. Taking as an example the dense Cholesky factorization of a symmetric positive definite matrix we show that the potential of non-canonical data structures for dense linear algebra can be better exploited with the u...
متن کاملRecursion based parallelization of exact dense linear algebra routines for Gaussian elimination
We present block algorithms and their implementation for the parallelization of sub-cubic Gaussian elimination on shared memory architectures. Contrarily to the classical cubic algorithms in parallel numerical linear algebra, we focus here on recursive algorithms and coarse grain parallelization. Indeed, sub-cubic matrix arithmetic can only be achieved through recursive algorithms making coarse...
متن کاملAutomatic Generation of Block-Recursive Codes
Block recursive codes for dense numerical linear algebra com putations appear to be well suited for execution on machines with deep memory hierarchies because they are e ectively blocked for all levels of the hierarchy In this paper we describe compiler technology to translate iterative versions of a number of numerical kernels into block recursive form We also study the cache behavior and perf...
متن کاملCompiler-Optimized Kernels: An Efficient Alternative to Hand-Coded Inner Kernels
The use of highly optimized inner kernels is of paramount importance for obtaining efficient numerical algorithms. Often, such kernels are created by hand. In this paper, however, we present an alternative way to produce efficient matrix multiplication kernels based on a set of simple codes which can be parameterized at compilation time. Using the resulting kernels we have been able to produce ...
متن کاملPrototyping Parallel LAPACK using Block-Cyclic Distributed BLAS
Given an implementation of Distributed BLAS Level 3 kernels, the parallelization of dense linear algebra libraries such as LAPACK can be easily achieved. In this paper, we brieey describe the implementation and performance on the AP1000 of Distributed BLAS Level 3 for the rectangular r s block-cyclic matrix distribution. Then, the parallelization of the central matrix factorization and the trid...
متن کامل